Google: "text recovery" "from pdf" -password "rotated pdf files" "LZW encoded pdf documents" http://www.adobe.com/support/techdocs/2d766.htm http://www.verypdf.com/pdf2txt/pdf2txt.htm Supports extract text from encrypted PDF file Source Code Licensing Includes the full source-code and allows distribution within the same organization or royalty-free distribution within your applications. The source code and executables can be modified to suit your own needs. With PDF2TXT source code you can: 1. Extract plain text from PDF files 2. Simple parse PDF file structure: include text and text coordinates while not including image and graphic information so you can integrate it into your program very easy 3. Extract the following information from PDF files (very useful for your search engine): title, subject, author, keywords, creator, producer, created date, security, version, page count, etc. http://www.verypdf.com/pdf2txt/pdf2txt.htm PDF2TXT(pdf to text) v3.0 ($38) PDF2TXT(pdf to text) software can extract text from pdf files -- it does NOT need Adobe Acrobat software. PDF2TXT(pdf to text) processes at very high speed and you can convert multiple pdf files to text files at one time. http://www.retsinasoftware.com/extract-convert-pdf-to-text.htm PDF Plain Text Extractor(P2T) is a tool that can extract plain text from PDF files without any PDF SDK or other third party lib's help. You don't need any products from Adobe (neither Adobe Acrobat Reader nor Adobe Acrobat) installed on your computer. P2T focus on text extraction from pdf file. It analyzes the raw pdf file directly and extract plain text from it. The layout of the document is reserved. Performance and precision are our goal. P2T supports PDF specification 1.x. A handy graphic user interface also provided. The trial version can only process the first 5 pages of pdf document, no file size limitation and 15 days free trial. http://www.download.com/PDF-Text-Viewer/3000-2079-10289153.html http://www.foxitsoftware.com/pdf/tv_intro.php PDF Text Viewer 2.2 (FREE) Foxit's PDF Text Viewer can view and extract text information from PDF documents. You also can use it as a free, lightweight, PDF viewer with printing support. Unlike Acrobat Reader or other tools, PDF Text Viewer extracts text information in a real, readable format for most PDF documents. You can directly read, print, or archive the converted result without much editing, or copy the selected portion to any other application. The program also can automatically convert all text information in a PDF file into a text file. For CREATING a PDF, OpenOffice does a pretty good job and it's free. ========================================================= Jay Carlson: On to PDF. The core of PDF is basically PostScript except you can't define your own procedures; you have to use the ones defined by Adobe. Looking at ComputerForensicsEtc.pdf, we see stuff like: 3 0 obj<>stream [2656 octets] endstream which is a zlib ("deflate") compressed stream. Peering inside it, we see: ===== /GS1 gs BT /TT2 1 Tf 12 0 0 12 72 708.84 Tm /Cs6 cs 0 0 0 scn -0.002 Tc 0.002 Tw [(T)-5(i)-4.2(tl)-4.2(e: Com)11(p)-5.8(u)-5.8(t)1(er F)8.8(o)-2(ren)-5.8(si)-4.2(cs - Com)11(p)-5.8(u)-5.8(t)1(er Use P)8.8(o)-2(l)-4.2(i)-4.2(c)1.8(y Revi)-4.2(ew)-9.8(s i)-4.2(n)-5.8( Cl)-4.2(assi)-4.2(f)-9(i)-4.2(ed)-5.8( Agen)-5.8(ci)-4.2(es)]TJ [...] ===== which should look very familiar to PostScript hackers. That last part is strings (in parentheses) alternating with spacing adjustments. Using a single "TJ" operator to write a line of text is only an optimization; the PDF file could also contain a huge pile of single-letter draws all over the page in random order. PDF is a very unfriendly format for naive dirty word searches. Also, because of its incorporation of compression, a physical block-by-block search of a hard drive won't work well; missing the beginning of a compressed hunk makes it very difficult to make sense of anything after it. (And, the better the compression algorithm, the harder it is to start in the middle.) For casual use, I'd use xpdf's pdftotext utility. There's pstotext too, but it uses Ghostscript, and I dunno if I want to feed potentially hostile documents to it. I don't think I'd want to do the certification paperwork on those kinds of external tools. BTW, for creating PDF files on windows, check out the GPL'd PDFCreator. http://sourceforge.net/projects/pdfcreator/